HW 9

Author

Ella Addink

Exercise 1

  1. Read chapter 4 of of Wilke (2019). List two or three things you learned about using color from this chapter.

    • When using color to distinguish between distinct items or groups that have no order, no one color should stand out relative to the others.

    • Diverging color scales are a good choice when we want to visualize the deviation of data values from a neutral midpoint in one of two directions. Usually we have a very light color as the midpoint and progressively darker colors going both to the right and left.

    • When using accent colors, it is very important that baseline colors do not distract and take the viewer’s attention away from the data meant to be accented. A simple way to make sure this is not a problem is to just remove all color from everything except for the data one wishes to highlight.

  2. Read chapter 5 of of Wilke (2019). Chapter 5 list a wide range of graphic types. What is the organization scheme Wilke uses to group them? What is the most interesting thing you learned from this chapter?

    • Wilke groups the graphic types by what the graphics are meant to visualize. Graphics can visualize amounts, distributions, proportions, x-y relationships, geospatial data, and/or uncertainty.

    • The most interesting thing I learned was when you can use mosaic plots, treemaps, and parallel sets. Treemaps work well when the subdivisions for each group are different from the subdivisions of others, whereas mosaic plots assume the subdivisions are the same across groups. Parallel sets work well if there are more than two grouping variables.

  3. By now, you should be able to make quite a few of the graphics shown in chapter 5. Are there some you would not know how to make (yet) based on what we have done in class? List several that you don’t know how to make – choose the ones you would be most interested in learning how to make and say why you are interested.

    • I am not sure how to make the: Q-Q Plot, Violin Plot, Strip Chart, Sina Plot, Mosaic Plot, Treemap, Parallel Sets, Density Contours, 2D Bins, Hex Bins, Cartogram Heatmap, or Confidence Strips

    • I would be most interested in learning how to make:

      • The violin plot - Violin plots look like an interesting alternative to a boxplot and it would be interesting to learn more ways to show distributions.

      • The density contours - These look like an intriguing way to show the distribution of data points when you have too many overlapping when simply kept as points.

      • The hex bins - Similar to the density contours, I can see how 2D bins or hex bins would be helpful when you have too many points on your scatterplot but would like to keep the sense of a scatterplot (rather than switching to a bar graph or something else) and I think the hex shapes are visually appealing.

  4. See if you can figure out how to make at least one of the types of graphic listed in part c. You may need to do some experimenting or to consult the documentation or example galleries (for Vega-Lite, vegabrite, altiar, Altair, etc.).

    • I decided to make the hex bins, however vega-lite does not seem to have a hexagon shape choice, so I switched to doing 2D bins instead. I tried to recreate Wilke’s example of a 2D histogram in Ch. 18 (Figure 18.6).
nyc_flights <- nycflights13::flights |> 
  select(flight, dep_time, dep_delay) |> 
  as.data.table()  # Trying to make it take less memory because rendering this is rough..
vl_chart() |> 
  vl_mark_rect() |> 
  vl_encode_x("dep_time:Q", bin = list(maxbins = 60), title = "departure time") |>   
  vl_encode_y("dep_delay:Q", bin = list(maxbins = 40), title = "departure delay (min)") |> 
  vl_axis_y(values = list(0, 300, 600, 900)) |> 
  vl_axis_x(values = list(0, 600, 1200, 1800, 2400), 
             format = "~r") |> 
  vl_encode_color("flight:Q", title = "departures",
                   legend = list(gradientLength = 70), 
                   scale = list(scheme = "purples")) |> 
  vl_aggregate_color("count") |> 
  vl_add_properties(width = 300, height = 200) |> 
  vl_add_data(nyc_flights)

I cannot seem to figure out how to properly format the times on the x axis. If I could turn the values into datetime values then I could probably figure out the d3 formatting to make the times look normal. However, these data values are just numbers stringed together so I am not sure how to use datetime to extract the hour and minutes.

  1. Chapters 6 - 16 are organized by graphic types. You have already read some of this material. Pick one chapter that is related to your graphic in part d and give some examples of how what you learned in that chapter affected how you created your graphic.

    • Ch. 18 is related to my graphic, so I read this chapter. Wilke explains in this chapter how 2D histograms can be a helpful alternative to scatterplots when the dataset is very large and many points overlap. Wilke explains that to create them, you must bin the data in two dimensions, subdivide the x-y plane into small rectangles, and then color the rectangles by the count of observations that fall into each one. Therefore, I used a “rect” mark to create rectangles, added bins to both the x and y encoding, and aggregated color so that it represented the count of flights for each departure time and delay amount.

Exercise 2

  1. Create side-by-side pie charts with good labeling and appropriate choice of colors. Some more things to consider: Should you present using counts or percentages? How will you deal with the nonresponders?
survey_data <- read.csv(file = "https://calvin-data304.netlify.app/data/likert-survey.csv", 
                        header = TRUE)
filtered_survey_data <- survey_data |> 
  filter(response != "no reponse") |> # removed no response counts
  group_by(year) |>
  mutate(total = sum(count)) |>
  ungroup() |>
  mutate(percent = count / total)
  #mutate(percent = round(percent, 2))
pies <- vl_chart() |>
  vl_mark_arc(outerRadius = 100) |> 
  vl_encode_theta("percent:Q", stack = TRUE) |> 
  vl_encode_color("response:N", legend = FALSE) |> 
  vl_encode_order("response:N")
Warning: Invalid schema for object passed to or created by
modify_inner_spec.vegaspec_unit
value_labels <- vl_chart() |>
  vl_mark_text(radius = 70) |> 
  vl_encode_theta("percent:Q", stack = TRUE) |> 
  vl_encode_text("percent:Q", format = ".2%") |> 
  vl_encode_color(value = "white") |> 
  vl_encode_order("response:N")

group_labels <- vl_chart() |>
  vl_mark_text(radius = 140) |> 
  vl_encode_theta("percent:Q", stack = TRUE) |> 
  vl_encode_text("response:N") |> 
  vl_encode_color(value = "black") |>
  vl_encode_order("response:N")

vl_layer(pies, value_labels, group_labels) |> 
  vl_facet_column("year:N", title = "") |> 
  vl_add_properties(width = 400, height = 200, title = "Survey Results") |>
  vl_add_data(filtered_survey_data) 
  1. Create some other kind of graphic to visualize the same data.
vl_chart() |> 
  vl_mark_bar() |>
  vl_encode_x("percent:Q", title = "") |> 
  vl_encode_y("year:N", title = "") |>
  vl_encode_color("response:N", title = "") |>
  vl_encode_order("response:N", sort = c("strongly disagree", "disagree", 
                                         "neither agree nor disagree", "agree", 
                                         "strongly agree")) |>
  vl_axis_x(format = ".2%") |> 
  vl_add_properties(width = 400, height = 200, title = "Survey Results") |>
  vl_add_data(filtered_survey_data) 
Warning: Invalid schema for object passed to or created by
modify_inner_spec.vegaspec_unit
  1. Comment on the pros and cons of your two (or more) graphics. Which would you recommend (for what purposes)?

    • Pie chart - I think the pie charts are nice for viewing each response group as part of the whole. I like that each wedge is labeled with its actual value. Some cons with the current design though is that there is not an order to the wedges and it would probably make sense for them to go from strongly disagree to strongly agree. I, however, could not figure out how to get it to order like this. In general, the pie charts are not great for comparing between the two years because it is very hard to perceive angle.
    • Stacked bar chart - I think the stacked bars make it easier to compare the two years. However, I struggled to get labels on the bars and without any, it is hard to tell what the actual percentages are for each response group. I also would have preferred to order the response groups from strongly disagree to strongly agree (right to left), but I again could not figure this out.
    • Overall, I would recommend the pie chart if you are more interested in comparing the responses within one year and the stacked bars for comparing the response percents between years.

Exercise 3

Create a choropleth map of the United States (by county or by state) or the worlds (by country) using any data you like. See the map slides for some possibilities, but feel free to get creative with your data.

us_map_url <- "https://cdn.jsdelivr.net/npm/vega-datasets@2.11.0/data/us-10m.json"

airports <- read.csv("https://cdn.jsdelivr.net/npm/vega-datasets@2.11.0/data/airports.csv")
state_map <-
  vl_chart() |>
  vl_add_data(
    url = us_map_url, 
    format = list(type = "topojson", feature = "states")) |>
  vl_add_properties(projection = list(type = "albersUsa")) |>
  vl_mark_geoshape(fill = "transparent", stroke = "steelblue") 

state_map |> vl_add_properties(width = 500, height = 300)

I was going to try to do Exercise 2 from the slides, that is making a map of the number of airports in each state. I got stuck on how to go about joining the airport data with the state map data. The state data does not seem to have state names or abbreviations, so do we use id? But then what does id match to in the airport data? Once joined, an aggregate would probably be involved to sum the airports in each state.